Search CORE

19 research outputs found

Live, Personal Data Integration Through UI-Oriented Computing

Author: C Cappiello
F Daniel
MJ Cafarella
O Díaz
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

This paper proposes a new perspective on the problem of data integration on the Web: the one of the Surface Web. It introduces the concept of UI-oriented computing as a computing paradigm whose core ingredient are the user interfaces that build up the SurfaceWeb, and shows how a sensible mapping of data integration tasks to user interface elements and user interactions is able to cope with data integration scenarios that so far have only be conceived for the Deep Web with its APIs and Web services. The described approach provides a novel conceptual and technological framework for practices, such as the integration of data APIs/services and the extraction of content from Web pages, that are common practice but still not adequately supported. The approach targets both programmers and users alike and comes as an extensible, open-source browser extension

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

A reappraisal of instrumental magnetic measurements made in Western Europe before AD 1750: confronting historical geomagnetism and archeomagnetism

Author: A Genevey
A Jackson
ART Jonkers
CC Finlay
E Thellier
E Thellier
F D’Ajello Caracciolo
FJ Pavón-Carrasco
G Hulot
GA Hartmann
I Bucur
J-C Tanguy
J-C Tanguy
L Cafarella
L Casas
M Alexandrescu
M Alexandrescu
M Goff Le
M Goff Le
MJ Aitken
R Holme
R Lanza
R Thomson
SRC Malin
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Research directions in data wrangling: Visualizations and transformations for usable and credible data

Author: Arasu A
Benjelloun O
Benjelloun O
Bongshin Lee
Buneman P
Cafarella MJ
Carr DB
Catherine Plaisant
Chappell D
Chris Weaver
Correa C
Dominique Brodbeck
Eaton C
Frank van Ham
Galhardas H
Gershon ND
Griethe H
Haas LM
Heer J
Hellerstein JM
Hellerstein JM
Huynh DF
Ives ZG
Jeffrey Heer
Jessie Kennedy
Kandel S
Kreuseler M
Lee B
Leshed G
Li L
Lin J
Lodha SK
Miller RC
Müller H
Nathalie Henry Riche
Norman DA
Olston C
Paolo Buono
Raman V
Robertson GG
Scaffidi C
Sean Kandel
Thomas JJ
Tuchinda R
Twiddy JC
Utwin A
Publication venue: SAGE Publications
Publication date: 01/01/2011
Field of study

In spite of advances in technologies for working with data, analysts still spend an inordinate amount of time diagnosing data quality issues and manipulating data into a usable form. This process of ‘data wrangling’ often constitutes the most tedious and time-consuming aspect of analysis. Though data cleaning and integration arelongstanding issues in the database community, relatively little research has explored how interactive visualization can advance the state of the art. In this article, we review the challenges and opportunities associated with addressing data quality issues. We argue that analysts might more effectively wrangle data through new interactive systems that integrate data verification, transformation, and visualization. We identify a number of outstanding research questions, including how appropriate visual encodings can facilitate apprehension of missing data, discrepant values, and uncertainty; how interactive visualizations might facilitate data transform specification; and how recorded provenance and social interaction might enable wider reuse, verification, and modification of data transformations

Crossref

Repository@Napier

PROCLAIM: An Unsupervised Approach to Discover Domain-Specific Attribute Matchings from Heterogeneous Sources

Author: C De Sa
D Rubenstein
D Vohra
MJ Cafarella
P Cerda
R Gupta
Publication venue: HAL CCSD
Publication date: 08/06/2020
Field of study

On lineInternational audienceSchema matching is a critical problem in many applications where the main goal is to match attributes coming from heterogeneous sources. In this paper, we propose PROCLAIM (PROfile-based Cluster-Labeling for AttrIbute Matching), an automatic, unsupervised clustering-based approach to match attributes of a large number of heterogeneous sources. We define the concept of attribute profile to characterize the main properties of an attribute using: (i) the statistical distribution and the dimension of the attribute's values, (ii) the name and textual descriptions related to the attribute. The attribute matchings produced by PROCLAIM give the best representation of heterogeneous sources thanks to the cluster-labeling function we defined. We evaluate PROCLAIM on 45,000 different data sources coming from oil and gas authority open data website3. The results we obtain are promising and validate our approach

HAL-CentraleSupelec

Crossref

HAL-Rennes 1

Fuzzy Semantic Labeling of Semi-structured Numerical Datasets

Author: G Limaye
JC Bezdek
M Taheriyan
M Weise
Minh Pham
MJ Cafarella
P Venetis
Sebastian Neumaier
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

SPARQL endpoints provide access to rich sources of data (e.g. knowledge graphs), which can be used to classify other less structured datasets (e.g. CSV files or HTML tables on the Web). We propose an approach to suggest types for the numerical columns of a collection of input files available as CSVs. Our approach is based on the application of the fuzzy c-means clustering technique to numerical data in the input files, using existing SPARQL endpoints to generate training datasets. Our approach has three major advantages: it works directly with live knowledge graphs, it does not require knowledge-graph profiling beforehand, and it avoids tedious and costly manual training to match values with types. We evaluate our approach against manually annotated datasets. The results show that the proposed approach classifies most of the types correctly for our test sets

Crossref

Archivo Digital UPM

OXPath: A language for scalable data extraction, automation, and crawling on the deep web

Author: A Heydon
Andrew Sellers
AO Mendelzon
B He
C Badica
C-H Chang
Christian Schallhart
D Gruhl
Georg Gottlob
Giovanni Grasso
J Myllymaki
J-Y Su
M Benedikt
M Marx
MJ Cafarella
MJ Cafarella
MK Bergman
P Boldi
Tim Furche
Y Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

SchemaTree:Maximum-Likelihood Property Recommendation for Wikidata

Author: A Dessi
AM Said
CY Suen
J Han
J Völker
JD Fernández
MJ Cafarella
RV Guha
S Razniewski
T Berners-Lee
W Gassler
Z Abedjan
Z Abedjan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

VU Research Portal

Crossref

Publikationsserver der RWTH Aachen University

Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings

Author: C Bizer
C Shao
CS Bhagavatula
E Daskalaki
E Jiménez-Ruiz
FM Suchanek
G Limaye
J Wang
MJ Cafarella
P Shvaiko
P Venetis
R Isele
S Araújo
S Zwicklbauer
V Christophides
X Guo
Z Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/10/2017
Field of study

International audienceWeb tables constitute valuable sources of information for various applications, ranging from Web search to Knowledge Base (KB) augmentation. An underlying common requirement is to annotate the rows of Web tables with semantically rich descriptions of entities published in Web KBs. In this paper, we evaluate three unsupervised annotation methods: (a) a lookup-based method which relies on the minimal entity context provided in Web tables to discover correspondences to the KB, (b) a semantic embeddings method that exploits a vectorial representation of the rich entity context in a KB to identify the most relevant subset of entities in the Web table, and (c) an ontology matching method, which exploits schematic and instance information of entities available both in a KB and a Web table. Our experimental evaluation is conducted using two existing benchmark data sets in addition to a new large-scale benchmark created using Wikipedia tables. Our results show that: (1) our novel lookup-based method outperforms state-of-the-art lookup-based methods, (2) the semantic embeddings method outperforms lookup-based methods in one benchmark data set, and (3) the lack of a rich schema in Web tables can limit the ability of ontology matching tools in performing high-quality table annotation. As a result, we propose a hybrid method that significantly outperforms individual methods on all the benchmarks

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

The ontological key: automatically understanding and integrating forms to access the deep Web

Author: Christian Schallhart
D Shestakov
Georg Gottlob
Giorgio Orsi
Giovanni Grasso
H He
KC-C Chang
MJ Cafarella
R Khare
Tim Furche
W Su
W Su
Xiaonan Guo
Z Bar-Yossef
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

The prevalence of COPD co-morbidities in Serbia: results of a national survey

Author: A Agusti
A Winkler
AM Yohannes
AS Buist
C Garvey
DG Tinkelman
EA Romme
H Joo
JD Maclay
JG van Manen
K Gruffydd-Jones
K Schnell
MJ Hynninen
MV Lopez Varela
PA Cafarella
R Nielsen
RG Barr
S Aryal
S Hagstad
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref